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Abstract To learn semantic attributes, existing methods typ¬ 
ically train one discriminative model for each word in a vo¬ 
cabulary of nameable properties. However, this “one model 
per word” assumption is problematic: while a word might 
have a precise linguistic definition, it need not have a precise 
visual definition. We propose to discover shades of attribute 
meaning. Given an attribute name, we use crowdsourced im¬ 
age labels to discover the latent factors underlying how dif¬ 
ferent annotators perceive the named concept. We show that 
structure in those latent factors helps reveal shades, that is, 
interpretations for the attribute shared by some group of an¬ 
notators. Using these shades, we train classifiers to capture 
the primary (often subtle) variants of the attribute. The re¬ 
sulting models are both semantic and visually precise. By 
catering to users’ interpretations, they improve attribute pre¬ 
diction accuracy on novel images. Shades also enable more 
successful attribute-based image search, by providing robust 
personalized models for retrieving multi-attribute query re¬ 
sults. They are widely applicable to tasks that involve de¬ 
scribing visual content, such as zero-shot category learning 
and organization of photo collections. 

Keywords Attribute learning and perception • Vision and 
language • Attribute discovery 


1 Introduction 


Attributes are semantic properties of objects and scenes. They 
can correspond to textures, materials, functional affordances. 


parts, moods, or other human-understandable aspects 


rari and Zisserman[|2007t[Lampert et al||2009[|Farhadi et S 

2009[|Parikh and Grauman| [201 lb[ [Kumar et al[[2Qll|). For 
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instance, a scene can be “manmade”, or one shoe can be 
“more formal” than another. By injecting language into vi¬ 
sual analysis, attributes broaden the visual recognition problem- 
from labeling images, to describing them. This linguistic in- 
terpretability opens up several interesting applications. For 
example, a user can search for an image by describing it (|Va-| 


quero et al 2009} Kumar et al| 2011 |Siddiquie et al||2011 


Kovashka et al| |2012t [Scheirer et al| |2012| ); train an ob¬ 
ject model by describing the category ( [Lampert et al| |200^ 
IParikh and Graum^|201 lb t [Kovashka et al| [MTT| |Parkash| 

land Parikh||2Q12| ); or help the system perform fine-grained 
recognition by naming the object’s properties (|Branson et al| 
[2 QTo1 ). 

Typically one defines a vocabulary of attribute words 
relevant to the domain at hand—e.g., a vocabulary of fa¬ 
cial characteristics for people search ( jKumar et al[ |2011| ), 
textures and parts for animals ( jLampert et al 2009} Wangj 
|et al[ |2009 Branson et al| 2010| ), or clothing properties for 
shopping ( |Berg et al' 201Qt Kovashka et al[|2Q12) ). Then one 
gathers labeled images depicting each attribute in the vocab¬ 
ulary, and trains a model to recognize each word. 


The problem with this standard approach, however, is 
that there is often a gap between language and visual percep¬ 
tion. In particular, the words in an attribute vocabulary need 
not be visually precise. An attribute word may connote mul¬ 
tiple “shades” of meaning—whether due to polysemy, vari¬ 
able context-specific meanings, or differences in humans’ 
perception. For instance, the attribute open can describe a 
door that’s ajar, a fresh countryside scene, a peep-toe high 
heel, or a backless clogj^Each shade is distinct and may re¬ 
quire dramatically different visual cues to correctly capture. 
Thus, the standard approach of learning a single classifier 
for the attribute as a whole may break down. 


^ Note multiple shades of an attribute may exist even within a spe¬ 
cific object category (like shoes, in this example). 
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Fig. 1 Our method uses the crowd to discover factors responsible for an attribute’s presence, then learns predictive models based on those visual 
cues. For example, for the attribute open, our method will discover multiple shades of meaning, e.g., peep-toed {open at toe) vs. slip-on {open at 
heel) vs. sandal-like {open at toe and heel), which are three visual definitions of openness. Since these shades are not coherent in terms of their 
global image descriptors, they would be difficult to discover using traditional image clustering. Discovering attribute shades requires both visual 
cues and semantics. 


Humans often form “schools of thought” based on how 
they interpret and use particular visual attributes. This prob¬ 
lem is studied in work on linguistic relativity ( [Everett|[2013| ), 
which examines how language affects perception and how 
cultural differences influence how people describe objects, 
shape properties of animals, colors, etc. Colors are the 
quintessential example: e.g., Russian has two words for what 
would be shades of “blue” in English, while other languages 
do not strongly distinguish “blue” and “green”. In other words, 
if asked whether an object in some image is “blue” or not, 
people of different countries might be grouped around dif¬ 
ferent answers, namely the shades of the attribute. Accord¬ 
ing to linguistic relativity, speakers of different languages 
might also exhibit different behavior in tasks involving lo¬ 
calization, positioning and classiflcation of objects ( |Levin- 


T996l|Ln3^[T99^ . 


son 


In addition to language-based factors, attribute use might 
also differ due to cultural factors. Eor example, a person who 
lives in the countryside might have a higher threshold for 
scene “naturalness” or lower threshold for scene “clutter”. 
Eurther, judgments of how “conservative” or “comfortable” 
a clothing item is might vary between different countries or 
even regions within the same country. Eor many attributes, 
such ambiguities in language use cannot be resolved by ad¬ 
justing the attribute deflnitions, since people use the same 
definition differently. 


Unfortunately, neither bottom-up attribute “discovery” 
nor relative attributes solve the problem. Unsupervised dis¬ 
covery methods detect clusters or splits in the low-level im¬ 
age descriptor space ( [Parikh and Graum^|201 lat|Rastegari| 
et al 2012[ Yu et al 2013| ). While they might discover flner- 


grained shades of some property, they need not be human- 
nameable (semantic). Eurthermore, discovery methods are 
intrinsically biased by the choice of features. Eor example, 
the set of salient splits in color histogram space will be quite 
different than those discovered in a dense SIET feature space. 
Similarly, unsupervised methods that cluster global image 
descriptors have no way to intelligently focus on only local¬ 


ized regions of the image, yet an attribute may occupy an 
arbitrarily small part of an image. 

Relative attributes ( [Parikh and GraumEi] [201 Ibj ) do not 
address the existence of shades, either. They represent whether 
an image has a property “more” or “less”. The point in rela¬ 
tive attributes is that people may agree best on comparisons 
or strengths, not binary labels. However, just like categor¬ 
ical attributes, relative attributes assume that there is some 
single, common interpretation of the property shared consis¬ 
tently by all human viewers—namely, that a single ordering 
of images from least to most [attribute] is possible. Thus, 
shades are relevant whether the attributes are modeled with 
classiflers (binary) or ranking functions (relative). 

Our goal is to automatically discover the shades of an 
attribute. An attribute shade is a visual interpretation of 
an attribute name that one or more people apply when 
judging whether that attribute is present in an image. 

Similarly, if learning relative attributes, a shade is an in¬ 
terpretation when judging whether that attribute is present 
more in image A or image B. See Pigure[^ 

Given a semantic attribute name, we want to discover 
its multiple visual interpretations and train a discriminative 
model for each one. Rather than attempt to manually enu¬ 
merate the possible shades, we propose to learn them indi¬ 
rectly from the crowd. Eirst we ask many annotators to label 
various images, reporting whether the attribute is present or 
not. Using their responses, we estimate latent factors that 
represent the annotators in terms of the kinds of visual cues 
that they associate with the attribute. Then, clustering in 
the low-dimensional latent space, we identify the schools 
of thought (about how to interpret this attribute) underlying 
the discrete set of labels the annotators provided. (We use 
the terms “school” and “shade” interchangeably.) Einally, 
we use the positive exemplars in each school to train a pre¬ 
dictive model, which can then detect when the particular at¬ 
tribute shade is present in novel images. 

The resulting models are both semantic and visually pre¬ 
cise. By discovering the shades from the crowd’s latent fac¬ 
tors, we isolate the features corresponding to the perceived 
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shades. This makes our method less susceptible to the more 
“obvious” splits in the feature space that an image clustering 
approach—including today’s sophisticated discovery meth¬ 
ods such as ( [Parikh and Graum^ |2011a[ |Rastegari et al| 
2Q12t |Yu et al| 2013| ) —may find, which need not directly 


support the semantic attribute of interest. 

Note that work in automatically finding the multiple senses 
of a polysemous word ( [Barnard et al[|2006]|Loeff et al||2006 


[Saenko and Darrell| [2008 Berg and Forsyth| 2006[ ) is or¬ 
thogonal to our goal, as it focuses on nouns (object cate¬ 
gories), not descriptive properties. Further, the visual differ¬ 
ences of polysemous nouns are usually stark (e.g., a river 
bank or financial bank). In contrast, attribute shades are of¬ 
ten subtle differences in interpretation. We study the prob¬ 
lem of automatically discovering shades of adjectives, and 
determining which shade of an adjective a user employs 
when judging whether a visual property is present or not in 
a particular image. 

On two datasets, we find that not only are the discov¬ 
ered shades visually meaningful, they are also well-aligned 
with annotators’ textual explanations of their labels. Most 
importantly, we show their practical utility to reliably esti¬ 
mate perceived attributes in novel images, which is crucial 
for any application relying on the descriptive nature of at¬ 
tributes (e.g., image search or zero-shot learning). 


2 Related Work 

Learning attributes Attributes are nameable visual proper¬ 


Farhadi et al 2QQ9[ Branson et al[ 201Qt [Wang and Mori 


ties that can aid both classification ( Lamport et al[ [2009 
Farhadi efS] [2009[ [Branson et al[ [2010t [Wan; 

2010t Parikh and Grauman|[201 lbt[Patterson anc 

and image search ( [Kumar eFall [2011 Vaquero 
Kovashka et al|[20l'^[Siddiquie et al 2011t[Sc 


Kovashka et al|[20l'^[Siddiquie et al 20111 [Scheirer et al 


2010t[Parikh and Grauman[[^l lbt[Patterson and Hays[[2012 ) 

eral|[2009 


2012[ ). Whether categorical or relative, prior work assumes 


that each attribute word corresponds to one coherent visual 


property, and so trains one classifier ([Ferrari and Zisserman 

2007 

Kumar et al 201 ![ Lampert et al[ [20091 [Farhadi ^1 

2009 

Vaquero et al[ [20091 [Branson ^1 20101 Wang and 

Mori 

20 101 [Patterson Hays[[2012[) or one ranking func- 

tion ( 

Parikh and Grauman 2011b Kovashka et al[[2012[) per 


attribute. 

Since annotators may disagree about the attribute label 
for an image ( Farhadi et al 2009[ Endres et~al|[20101 [Pat¬ 


terson and Hays 


2012 


Curran et al 


2012[ ), the norm i 


IS 


to take the majority vote label (and discard the image if 
votes are too split). Thus, prior work treats differences in 
attribute perception as noise. To our knowledge, the only 
exception is our transfer learning approach ( [Kovashka and[ 
Grauman 2013[ ), which trains user-specific models for per¬ 
sonalized image search. In that work, we adapt a generic 
model for an attribute using training data from each indi¬ 
vidual user, and the method produces one attribute classifier 


for each user. In contrast, in this work we discover schools 
of thought among the crowd, and our method produces a 
set of attribute shades capturing commonly perceived varia¬ 
tions. These schools of thought are a valuable midpoint on 
the spectrum from purely consensus models to purely user- 
specific models, resulting in better accuracy for perceived at¬ 
tributes (cf. Sec. [3.4[ ). Shades also have broader utility than 
the adapted user-specific models ( [Kovashka and Grauman 


2013[ ), since they let us explicitly organize perceived prop¬ 
erties. 

Distinction with relative attributes We stress that relative at¬ 
tributes ( [Parikh and Grauman] [201 lb[ ), while avoiding the 
need for forced categorical judgments, still assume a sin¬ 
gle underlying visual property exists. They do not repre¬ 
sent multiple interpretations. For example, relative attributes 
construct a universal model for “less brown” vs. “more brown”. 
They do not address the issue that one person may say “im¬ 
age X is browner than Y”, while another may say the op¬ 
posite. Shades, on the other hand, are concerned with dis¬ 
covering multiple models for varying perceptions of brown, 
e.g., chocolate brown vs. goldish brown. The two goals are 
orthogonal. In fact, while we study categorical attributes, 
the proposed approach could easily be applied to discover 
shades of relative attributes; the label matrix in Sec. [3.2[ would 
simply record whether the person finds a first image to ex¬ 
hibit the attribute more or less than a second image. 

Defining attribute vocabularies Most work defines the at¬ 
tribute vocabulary manually, or by eliciting discriminative 
properties from annotators ( [Patterson and Hay^[2Q12[[M^ 
2Q12[ ). However, in some cases it is possible to generate 
it (semi-)automatically, as in ( [Wang et~n| [2009} [Branson 


et al[[2010[[Berg et~al| [20101 [Parikh and Grauman[ [2011a 

Rohrbach et al[[2Ql^ . For animal species, field guides are a 
natural source of attribute names ( [Wang et ^ 2009 Branson 
et al| 2010[ ). Given their focus on concrete parts, such do 
mains are less prone to shades. When suitable text sources 
are available—such as captioned images on web pages ( [Berg 
et al[ [2010[ ) or activity scripts ( [Rohrbach et al[ [2012[ ) —one 
can mine for candidate attribute words. Since not all words 
will be visually detectable, some work aims to prune the vo 
cabulary automatically ( [Berg et al][20101 [Barnard and Yanai 


2006[ ). Rather than mined text, our shades use sparse crowd 
labels to capture latent interpretations of an attribute, which 
may not be concisely describable with a keyword. 

Discovering non-semantic attributes While the term “attribute” 
typically connotes a semantic property, some researchers also 
use the term to refer to discovered non-semantic features 
( [Mahajan et al[ [20111 [Rastegari et al[ [20121 [Sharmansk^ 

|et al 2012[ Yu et al 2013 1. The idea is to identify “splits” 
or clusters in the low-level image descriptor space, often 
subject to constraints that deter redundancy and promote 
discriminativeness for object recognition. However, being 
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bottom-up, there is no guarantee the splits will correspond 
to a nameable property. Hence, unlike our shades, they are 
non-semantic and inapplicable to descriptive attribute tasks, 
like image search or zero-shot learning. One can attempt to 
assign names to discovered “attributes” after the fact, as in 
( jParikh and GraumSil |2011a[ |Duan et ^ |2012| |Yu et al| 
2Q13| ), but the patterns that are even discoverable remain bi¬ 
ased by the chosen low-level image feature space, as dis¬ 
cussed above. Semantics and human interpretability are es¬ 
sential if human users are to use attributes to communicate 
with a vision system. 


Polysemy and domain adaptation A polysemous word has 
multiple “senses” or meanings. Some work bridging text 
and visual analysis aims to cluster Web images according 
to distinct senses ( [Barnard et al[ |2006t [Loeff et ^ |2006t 
[Saenko and Darrell 2008} [Berg and Forsyth[ [200^ . Other 
approaches find within-category modes in order to perform 
better domain adaptation for object recognition ( Hoffman] 
et al| [2012t [Gong et [2Q13t [Xiong et~n| [2014 1. These 


works are orthogonal to our goal, as they focus on nouns 
and object categories, not descriptive properties. Typically 
the visual differences between senses of a polysemous word 
(or surrounding text context) are much larger than between 
attribute shades of meaning. Distinctions between attributes, 
on the other hand, are more subtle, and they are tied to se¬ 
mantics more so than to visual differences. Furthermore, un¬ 
like a truly polysemous word, for which one can enumerate 
the multiple dictionary definitions, attribute shades are often 
more difficult to definitively express in language. We show 
how to automatically infer them from trends in crowd labels. 


Aggregating crowd labels Crowd input has been aggregated 
in novel ways for image clustering ( [Gomes et al[ 2011[ ), 
image similarity ( [Tamuz et al[ [2011[ ), and object labeling 
( [Welinder et al[ [2010[ ). Welinder et al. model annotators’ 
competence and bias to discover their schools of thought, 
and subsequently undo their biases to produce more reliable 
ground truth. While that work aims to recover a single true 
label for each image, our goal is to discover the crowd’s mul¬ 
tiple interpretations of a label. 

Our method makes use of an existing matrix factoriza¬ 
tion algorithm ( [Salakhutdinov and Mnih[[2008] ). Matrix fac¬ 
torization is often used for matrix completion, to solve col¬ 
laborative filtering problems (e.g., the Netfiix challenge) by 


exploiting commonalities among users (Salakhutdinov and 


Mnih 2008[ Xiong et al 2010). Rather than impute miss¬ 


ing labels, we propose to use the latent factors themselves 
to represent the interplay between language, human percep¬ 
tion, and image examples. Furthermore, we show how to use 
the recovered schools of thought to build content-based at¬ 
tribute models. 


3 Approach 

In order to discover shades of attributes, we first recover the 
latent factors that motivate a user’s annotations of an im¬ 
age with a given attribute’s presence or absence. We then 
represent each user in this latent space, and discover group¬ 
ings among users. Each group or school is mapped to the 
images which are most frequently believed to contain the 
attribute, according to the corresponding shade of the at¬ 
tribute. Using these images, we learn models that predict 
whether the attribute is present or not in a novel image, for 
some school/shade. 

We first explain the crowdsourced label collection in Sec¬ 
tion [3T] Then we describe how we recover the latent factors 
responsible for those labels (Section [3.2[ ) and use them to 
discover attribute shades (Section [3.3[ ). Finally, we exploit 
the discovered shades to improve attribute prediction by ac¬ 
counting for the users’ varying interpretations (Section [3^. 


3.1 Collecting Crowd Labels per Attribute 


We use two datasets: Shoes ( [Berg et ^ 


2010 


Kovashka 


et allpon] ) and SUN Attributes ( [Patterson and Hays 2012[ ) 


While attribute labels are available for both, our method 
needs to record which annotator labeled which image. Thus, 
we run our own crowdsourced label collection. 

To focus our study on plausibly “shaded” words, we se¬ 
lect 12 attributes that can be defined concisely in language, 
yet may vary in their visual instantiations. This helps en¬ 
sure that variance in the annotators’ labels stems from the 
attribute’s visual sub-meanings, as opposed to external fac¬ 
tors like the annotator’s personal taste. The 12 attributes are: 
“pointy”, “open”, “ornate”, “comfortable”, “formal”, “fash¬ 
ionable”, “brown” (for Shoes); and “cluttered”, “soothing”, 
“open area”, “modern”, “rustic” (for SUN). We obtain def¬ 
initions of the attributes from a web dictionary, and show 
these in Table [T] 

In general, we choose words whose application in con¬ 
versation requires some interpretation of the definition. This 
interpretation can revolve around judging thresholds and es¬ 
tablishing what factors cause the definition to hold. For ex¬ 
ample, for the “open area” attribute, one is required to judge 
what constitutes “unobstructed passage”; for “open”, how 
many (and how big) gaps there are; for “ornate”, which pat¬ 
terns matter; for “comfortable”, what aspect of the shoe causes 
comfort. We also choose words whose presence or absence 
involves personal knowledge or beliefs; e.g., for “rustic”, 
one should determine what country life is like. 

Our decision to focus on words likely to have shaded 
meanings lets us examine the problem at hand most directly. 
However, even if some attributes in the pool turn out to be 
fairly precise visually, our method is capable of returning 
few shades or just one shade, since we employ automatic 
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Attribute 

Dictionary definition 

Pointy 

Open 

Ornate 

Comfortable 

Formal 

Brown 

Fashionable 

having a comparatively sharp point, or 
having numerous pointed parts 
having interspersed gaps, spaces, or 
intervals 

made in an intricate shape or decorated with 
complex patterns 

providing physical comfort, ease and 
relaxation 

designed for wear or use at elaborate 
ceremonial or social events 

the color of, for example, chocolate and 
coffee 

conforming to the current fashion; stylish; 
trendy; modem 

To clutter 

to make disorderly or hard to use by filling 
or covering with objects 

To soothe 

to bring comfort, composure, or relief 

Open (area) 

affording unobstmcted passage or view 

Modem 

characteristic or expressive of recent times 
or the present; contemporary 

Rustic 

of, relating to, or typical of country life or 
country people 


Table 1 The 12 attribute definitions shown to annotators. 


model selection. Thus, applying the shades discovery algo¬ 
rithm we propose to a “less shaded” word should in principle 
do no harm. 

We sample N = 250 to 1000 images per attribute. To 
get representative images spanning the dataset, we cluster 
all images using iC-means, then sample ones near the cluster 
centers]^ This yields a total of 2559 images for Shoes and 
2086 images for SUN. 

We build a Mechanical Turk interface to gather the la¬ 
bels. Workers are shown definitions of the attributes (Table 
as part of the task instructions. These instructions are vis¬ 
ible during task completion. However, workers are shown 
no example images. Thus, they all receive the same linguis¬ 
tic definition, but they are not prompted with any particu¬ 
lar visual definition. Then, given an image, the worker must 
provide a binary label, i.e., he or she must state whether the 
image does or does not possess a specified attribute. Addi¬ 
tionally, for a random set of five images, the worker must 
explain his label in free-form text, and state which image 
most has the attribute, and why. These questions both slow 
the worker down, helping quality control, and also provide 
valuable ground truth data for evaluation, as we will explain 
in Section [431 

Our latent factor model (defined next) can accommo¬ 
date imbalanced and sparse labels. This is good, because in 

^ For “brown”, we sample images with high scores output by a 
“brown” classifier. This attribute is rare, so sampling cluster centers 
would produce very few brown images. 


realistic scenarios, labels may not originate from concen¬ 
trated one-time labeling efforts (like ours), but rather as a 
side product of another task—such as click data in image 
search. In such a case, the images that one user labels will 
not entirely overlap with those that another user labels. Fur¬ 
thermore, each user will label few examples. To mimic this 
scenario, we gather labels in a sparse fashion. Each worker 
labels 50 randomly chosen images, per attribute. To help en¬ 
sure self-consistency in the labels, we exclude workers who 
fail to consistently answer three repeated questions sprin¬ 
kled among the 50. This yields annotations from 195 work¬ 
ers per attribute on average. 

While multiple workers may label the same image, we 
stress their labels are not aggregated to create a majority 
vote “ground truth”. The main premise of shades is that at¬ 
tribute names can be visually imprecise and so admit multi¬ 
ple interpretations. The same attribute word can have differ¬ 
ent meanings to different people, even if they all know the 
same linguistic definition of the word. (Contrast this with 
object category names, which are relatively precise.) Thus, 
rather than discard label discrepancies as noise, we use them 
to discover shades. 

3.2 Recovering Latent Factors for Attribute Labels 

Now we use the label data to discover latent factors, which 
are needed to recover the shades of meaning. Note that we 
learn factors for each attribute independently, so all vari¬ 
ables below are attribute-specific. From the above data col¬ 
lection, we retain each worker’s ID, the indices of images he 
labeled, and how he labeled them. Let M denote the num¬ 
ber of unique annotators, and let N denote the number of 
images seen by at least one annotator. Let L be the M x 
label matrix, where Lij G {0,1,?} is a binary attribute label 
for image j by annotator i. A ? denotes an unlabeled exam¬ 
ple. The matrix is only partially observed, as on average only 
20% of the possible image-worker pairs are labeled. 

We suppose there is a small number D of unobserved 
factors that infiuence the annotators’ labels. This refiects 
that their decisions are driven by some mid-level visual cues. 
For example, when deciding whether a shoe looks “ornate”, 
the latent factors might include presence of buckles, amount 
of patterned textures, material type, color, and heel height; 
when deciding whether a scene looks “modern”, they might 
include color, object composition, and materials. 

Assuming a linear factor model, the label matrix L can 
be factored as the product of an M x D annotator latent 
factor matrix and sl D x N image latent factor matrix I: 

L = A^I. (1) 

A number of existing methods can be used to factor this 
partially observed matrix, by finding the best rank-D ap- 
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proximation under some loss function (Salakhutdinov and 


Mnih[ |2007[ |2008[ [Xiong et S| |2010| ). We use a probabilis¬ 


tic matrix factorization algorithm (PMF) from ( jSalakhutdFj 
nov and Mnihj [2007 1 12008| ), due to its efficiency for large, 


sparse matrices. Briefly, it works as follows. PMF takes a 
probabilistic approach to recover the two low-rank matrices. 
Let Ai and Ij denote columns of A and I, respectively, and 
£ij = 1 if we received a label on image j by annotator i, 
and £ij = 0 otherwise. The likelihood distribution for the 
observed labels is 


M N 

p(L|A,I,a2) = (2) 

i=lj = l 

where cr^) denotes a Gaussian distribution with mean 

/i and standard deviation . The priors over the latent fac¬ 
tors are spherical Gaussians: 


p{0i\9o) = Af{^ii\fio,{l3oAi)-^)W{Ai\Wo,Po), (7) 

where 6>o = {/tq, pq, Wq}, po = 0, /3o = 1, t'o = D, and 
Wo is the identity matrix. 

Imputing Lij for some unknown labeling of user i and 
image j is then predicted via MCMC: 

p{L*j\L,Oo) « 2 ( 8 ) 

r=l 

where the samples } are generated in parallel via 

Gibbs sampling as: 

^p(Ai|L,lW,0M),and 


M 


P{A\a\) = P[A/'(Ai|0,cr^I),and 




N 


p{l\a]) = nV(J,10,^11). 

i=i 


(3) 


(4) 




(9) 


We seek the latent features that maximize the log-posterior: 


We obtain our estimates of A and I by averaging the R sam¬ 
ples for each. 

This Bayesian treatment reduces overfitting and saves 
parameter tuning. See ( [Salakhutdinov and Mnih| [2008| ) for 
details. 


A*, I* = arg max lnp(A, I|L, cr^, erf). 

A,I 


(5) 


Obtaining the MAP factors amounts to minimizing an SSD 
objective function with quadratic regularization terms using 
gradient descent ( [Salakhutdinov and Mnih[[2007] ): 

M N \ ^ 

^= 0 - Afijf + ^ X 11^*11' 


i=l j=l 


i=l 


\ ^ 
i=i 


( 6 ) 


where = cr^ / a\ and Xj = la‘j, and we use the Frobe- 

nius norm. 

This approach is a probabilistic extension of what would 
be standard SVD in the case of fully observed labels. How¬ 
ever, performance might depend on careful tuning of param¬ 
eters such as cr^, cr^, cr|. Upgrading to a full Bayesian treat¬ 
ment ( [Salakhutdinov and Mnihj [2008] ), we put priors on the 
user and image hyperparameters. Let the mean and preci¬ 
sion matrix of the user and image prior distributions be de¬ 
noted by liA and fii, and A^ and A/, respectively, and let 
0A = {ma, Aa} and 0/ = {/i/, A/}. We place Gaussian- 
Wishart priors on these hyperparameters 0 a and 0/: 


p{Oa\Oo) = A/'(/iA|Mo,(^oAA) ^)>V(AA|Wo,z^o),and 


3.3 Discovering Shades of Meaning 

In collaborative filtering, the goal of the factorization de¬ 
scribed above is to impute missing labels (e.g., to predict 
how a user will rate an unseen movie, Lij « (Ai^Ij)). 
While missing labels could similarly be estimated for our 
data, our goal is different. We aim to discover attribute shades 
of interpretation and generate predictive visual models for 
them. 

To this end, we first represent each annotator in terms 
of his association with each discovered factor. The “latent 
feature vector” for annotator i is G the i-th col¬ 
umn of A. It represents how much each of the D factors 
infiuences that annotator when he decides if the named at¬ 
tribute is present. Likewise, the latent feature for image j is 
Ij G the j-th column of I, and represents how much 
each of the D factors is visible in the image. 

Figure [^illustrates with a cartoon example. As seen on 
the left, annotators did not label all images for the attribute 
“open”. Some tended to label images 1 and 2 as having the 
attribute, whereas others tended to label 3 and 4 as positive. 
After factoring the label matrix, suppose we discover D = 2 
latent factors. Though nameless, they align with semantic vi¬ 
sual cues; suppose here they are “toe is open” and “heel is 
open”. Each annotator’s feature Ai encodes how important 
those two factors were for his label decision. In this hypo¬ 
thetical example, we see the first three annotators labeled 
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Fig. 2 Given a partially observed attribute-specific label matrix (left), we recover its latent factors and their infiuence on each annotator (middle). 
We discover shades by clustering in this space (dotted lines in center and images on right). 


images 1 and 2 as open due to factor 1, whereas the others 
focused on factor 2 in other images. 

We pose shade discovery as a grouping problem in the 
space of these latent features]^ While various clustering al¬ 
gorithms could be used, we apply iC-means to the columns 
of A to obtain clusters {5i,... Each cluster is a 

shade. Annotators in the same cluster display similar label¬ 
ing behavior, meaning they interpret similar combinations 
of mid-level visual cues as salient for the attribute at hand. 
For example, in Figure the two dominant shades reflect 
which part of the shoe the annotator focused on to judge 
openness—toe or heel. (Of course, for real data, there will 
be 17 > 2 factors, and shades will combine many such fac¬ 
tors.) 

Recall that shade discovery is done on a per-attribute ba¬ 
sis. Depending on the visual precision of the word, some at¬ 
tributes may have only one shade; others may have many. 
To automatically select K based on the structure of the data, 
we use a variant of the silhouette coefficient ( |Rousseeu^ 
1987| ). It quantifles the quality of a clustering, by measur¬ 
ing how tightly grouped the latent features in a cluster are, 
normalized by how far they are from other clusters. More 
speciflcally, let be the average Euclidean distance of a 
cluster member i to its neighbors (members of the same 
cluster), and let bi denote the mean distance of i to other 
clusters, where the distance to each cluster is measured as 
an average over distances to the cluster members. Then let: 


* ~ / 7 N • 

max(ai, bi) 

^ Though we can cluster either annotators or images to identify 
shades, we choose annotators in order to facilitate the mapping of users 
to shades when building predictive models for the shades, as described 
in Section [T^ 

^ Preliminary tests with Bayesian non-parametric clustering showed 
inferior results. An alternative would be to impute missing labels and 
group with EM, but clustering in the compact latent space is preferable 
when labels are very sparse. 


The silhouette coefficient is computed as the mean of the 
values Si. 

As discussed above, by using automatic model selection, 
our approach is free to decide that an input word is already 
visually precise, not requiring many shades. 


3.4 Using Shades to Predict Perceived Attributes 


A key valuable application of shades is to improve attribute 
prediction accuracy, generalizing what the system discov¬ 
ered to novel images. 

Any method leveraging the descriptive nature of attributes 
needs to rely on attribute models that match a human user’s 
perception. For example, an image search system that al- 


etal 2011 

IScheirer et al| 20121 

Rastegari et al| |2013 |Va- 

quero et al 

, 2009 Kovashka et al| 

2012]) will frustrate a user 


if the system’s notion of “formal” does not match the user’s 
notion. Similarly, a zero-shot object recognition system that 
trains a new object model based on its attribute speciflca- 
tion will fail unless it correctly interprets the visual meaning 
intended by the human teacher. 

Prior work uses one of two extremes for attribute prediction- 
either (1) a consensus classifler: a single generic model trained 
with examples whose labels are obtained through a major- 


ity vote over multiple redundant crowd responses (e.g., (Ku- 

mar et al 201 1[ Fampert et al 2009 

Farhadi et al 2009 

Vaquero et al 2009 Patterson and F 

[ays 2012|)), or (2) a 


user-specific classifler which is trained by adapting that ma¬ 
jority vote model to satisfy an individual user’s training la¬ 
bels ( Kovashka and Grauman[ 2013| ). In the latter approach, 
we collect between 12 and 40 labels per attribute from each 
user, and apply them to learn a user-speciflc attribute model, 
which we regularize with the parameters of the generic model. 

Shades offer an approach in between these two extremes. 
With shades, we can account for the fact that people per- 
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Fig. 3 We learn predictive models for shades, by adapting a standard 
consensus model trained from any users in the crowd towards particular 
schools of users. 


ceive an attribute differently, yet avoid specializing predic¬ 
tions down to the level of each individual user. The idea is 
to tailor an attribute classifier according to the user’s “school 
of thought”, i.e., the shade to which he subscribes. 

To exploit the existence of schools of thought, we train 
shade-specific classifiers that adapt the consensus model. 
See FigureEach shade Sk is represented by the total pool 
of images that its annotators labeled as positive. Several an¬ 
notators in the cluster may have labeled the same image, 
and their labels need not agree. Thus, we perform majority 
vote over just the annotators in Sk to decide whether an im¬ 
age is positive or negative for the shade. This majority vote 
is a form of quality control, where we assume consistency 
within the group. For both the shade models and the consen¬ 
sus model, we discard labels where fewer than 90% of users 
agree. 

We use the resulting image-label pairs to train a discrim¬ 
inative classifier, using the adaptive support vector machine 
(SVM) objective of | Yang et al| ( |20Q7] ) to regularize its pa¬ 
rameters to be similar to those of the consensus model. In 
other words, we are now personalizing to schools of users, 
as opposed to individual users. See Figurej^for an overview 
of this procedure. Then we apply the adapted shade model 
for the cluster to which a user belongs to predict the pres¬ 
ence/absence of the attribute in novel images. Thus, the pre¬ 
dictions are automatically tailored to that user’s perception 
of the property. 

To recap, shades offer an important midpoint on the spec¬ 
trum discussed above. Compared to the standard consensus 
approach, we account for distinct perceived shades. Com¬ 
pared to user-adaptive models, the advantages are twofold. 
First, each model typically leverages more training data than 
a single user provides. This lets us effectively “borrow” la¬ 


beled instances from the user’s neighbors in the crowd. Sec¬ 
ond, we leverage the robustness of the intra-shade majority 
vote. This helps reduce noise in an individual user’s label¬ 
ing. The results in Section |4.2| reveal the impact of these 
advantages in practice. 

Note, a user must provide at least some attribute labels to 
benefit from the shade models, since we need to know which 
shade to apply. For users who contributed to the label matrix 
L this is straightforward. For users adding labels later, we 
could either re-factor L, or more efficiently, use a folding- 
in heuristic ( [Deerwester et al[ |1990| [Hofmann] \\999) (not 
attempted in our experiments). 


3.5 Discussion 


The key thing to note about the shade classifiers is how 
their positive labeled exemplars came about. Images within 
a shade can be visually diverse from the point of view of 
typical global image descriptors, since annotators attuned to 
that shade’s latent factors could have focused on arbitrar¬ 
ily small parts of the images, or arbitrary subsets of feature 
modalities (color, shape, texture). For example, one shade 
for “open” might focus on shoe toes, while another focuses 
on shoe heels. Similarly, one shade for “formal” capturing 
the notion that dark-colored shoes are formal would rely on 
color alone, while another capturing the notion that shoes 
with excessively high heels are not formal would rely on 
shape alone. An approach that attempts to discover shades 
based on image clustering—or non-semantic attribute dis¬ 
covery such as ( jParikh and Graum^ [201 la[ [Mahajan et al| 


2Qllt[Duan et al|[2012t|Rastegari et al] |2012[|Sharmanska 


et al| |2Q12t |Yu et al| |2Q13| ) —will be hard pressed to group 


images according to these perceived, possibly subtle, cues. 
Our insight is to leverage patterns among the crowd labels 
to partition the images semantically. Then, even though the 
training images may be visually diverse, standard discrim¬ 
inative learning methods let us isolate the informative fea¬ 
tures. Essentially, we avoid biasing the shades to a partic¬ 
ular low-level descriptor space, since their training images 
are determined independent of the descriptors. 

One might wonder: why not just manually enumerate 
the attribute shades with words? Our approach has multiple 
advantages over that strategy, beyond being automatic. For 
polysemous nouns, the visual definitions are enumerable— 
one could simply check the dictionary. In contrast, it can be 
difficult to put an attribute’s distinct visual instantiations in 
words, e.g., by automatically generating all possible qual¬ 
ifiers for an attribute. This would amount to automatically 
listing all possible contexts in which an object can occur, 
all possible shapes a human body can take, etc. Neither can 
we rely solely on mining the textual explanations gathered 
from users to qualify attributes. We find that the words an¬ 
notators typically provide to explain their interpretation of 
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Image 

Attribute 

Present? 

Explanation 

i 

Ornate 

No 

"Ornate means decorated with extra items not inherent in the 
making of the object. This boot has a camo print as part of the 
object, but no additional items put on it." 

J 

Ornate 

Yes 

"The flowerprint pattern is unorthodox for a rubber boot and really 
stands out against the jet black background." 

M 

Open area 

Yes 

"This is an enclosed area, but the room is very large and the ceiling 
is very high, giving a lot of room. 1 think that this makes it an 
enclosed area that is also an open area." 


Open area 

No 

"1 do not consider the image to show an open area because the 
area shown is enclosed by walls. It is a larger space on the interior 
of the building so it does have some aspects of an open space." 


Comfortable 

Yes 

"The heel is shorter and looks more sturdy with the thickness of 
the heel which would make it more comfortable then your typical 

heel." 


Formal 

Yes 

"1 believe the formal aspect of this should is the color and design of 

the the fabrics on this shoe. 1 felt this shoe would be used by a 
person who wanted to be formal yet comfortable." 


Fig. 4 Example label explanations that annotators provided. Bold is our emphasis. In the first two rows, notice that the same type of shoe (one with 
patterns) can be perceived to have a different level of ornamentation, depending on whether the annotator believes patterns constitute ornamenta¬ 
tion. Further, a room with large spaces (rows 3 and 4) can be perceived as an open area or not, depending on whether the annotator believes an area 
enclosed by walls can be considered open. Finally, in the last two rows we see two interesting examples of a high-heeled shoe (which is normally 
labeled as uncomfortable) considered comfortable due to its sturdy heel, and a sneaker-like shoe seen as formal due to its color and design. Also 
notice how well-thought out these user responses are, which indicates that the quality of data we collected is high. 


an attribute are concrete instances of the shade, which need 
not comprehensively define the shade. For example, in our 
data collection, when asked to explain why an image is “or¬ 
namented”, an annotator might comment on the “buckle” 
or “bow”; yet the latent shade of “ornamented” underlying 
many users’ labels is more abstract. It encompasses combi¬ 
nations of such concrete mid-level cues. In short, we find 
that people are good at naming examples, but less good at 
characterizing an entire shade in words. Our method fills that 
gap, using structure in the labels to identify shades. 

Shades require no additional labeling effort compared to 
the existing user-specific approach ( [Kovashka and Grauman 


tion approach offers numerous advantages over alternative 
approaches, for only a small complexity overhead. 


4 Experimental Validation 

We first demonstrate shades’ key utility for improving at¬ 
tribute prediction (Section |4.2[ ) and attribute-based image 
search (Section [4^ . We then quantitatively analyze the pu¬ 
rity of the discovered shades (Section|4^. We offer compar¬ 
isons to existing techniques, including both standard con¬ 
sensus attributes as well as state-of-the-art methods for at- 


we utilize data the user has not labeled but neighbors have tribute discovery (jRastegari et al| 

2012J and personalized at- 

labeled, thus reducing the manual annotation effort. In terms tributes ([Kovashka and Grauman 

2013J. We analyze shades 


of computational complexity, the only added cost compared 
to the method of ( [Kovashka and Grauman 2Q13| ) is running 


the Bayesian PMF method, which requires about 21 minutes 


per attribute (see Section 4.1). Therefore, our shade forma- 


qualitatively (Section |4.5| ) to visualize what is discovered. 
Finally, we show how to transfer shades between attributes 
and users in order to predict how a user will interpret an 
attribute for which he has provided no labels (Section [46| ). 
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Attribute 

Shades 

Standard 

User-exclusive 

User-adaptive 

Attribute discovery 

Image clusters 

([Kovashka and Grauman 2013[ 

(Rastegari et al 2012| 

Pointy 

76.3 (0.3) 

74.0 (0.4) 

67.8 (0.2) 

74.8 (0.3) 

74.5 (0.4) 

74.3 (0.4) 

Open 

74.6 (0.4) 

66.5 (0.5) 

65.8 (0.2) 

71.6 (0.3) 

68.5 (0.4) 

68.3 (0.4) 

Ornate 

62.8 (0.7) 

56.4(1.1) 

59.6 (0.5) 

61.1 (0.6) 

58.3 (0.8) 

58.6 (0.7) 

Comfortable 

77.3 (0.6) 

75.0 (0.7) 

68.7 (0.5) 

75.5 (0.6) 

76.0 (0.7) 

75.4 (0.6) 

Formal 

78.8 (0.5) 

76.2 (0.7) 

69.6 (0.4) 

77.1 (0.4) 

77.4 (0.6) 

77.0 (0.6) 

Brown 

70.9(1.0) 

69.5 (1.2) 

61.9 (0.5) 

68.5 (0.9) 

69.3(1.2) 

69.8 (1.2) 

Fashionable 

62.2 (0.9) 

58.5(1.4) 

60.5 (1.3) 

62.0(1.4) 

61.2(1.4) 

61.5(1.1) 

Cluttered 

64.5 (0.3) 

60.5 (0.5) 

58.8 (0.2) 

63.1 (0.4) 

60.4 (0.7) 

60.8 (0.7) 

Soothing 

62.5 (0.4) 

61.0 (0.5) 

55.2 (0.2) 

61.5 (0.4) 

61.1 (0.4) 

61.0 (0.5) 

Open area 

64.6 (0.6) 

62.9(1.0) 

57.9 (0.4) 

63.5 (0.5) 

63.5 (0.8) 

62.8 (0.9) 

Modern 

57.3 (0.8) 

51.2 (0.9) 

56.2 (0.7) 

56.2(1.1) 

52.5 (0.9) 

52.0(1.1) 

Rustic 

67.4 (0.6) 

66.7 (0.5) 

63.4 (0.5) 

67.0 (0.5) 

67.2 (0.5) 

67.2 (0.5) 


Table 2 Accuracy of predicting perceived attributes, with standard error in parentheses. Our shades provide robust models that capture personalized 
notions of the attributes, yet do not overfit to possible noise in a user’s labels. 


4.1 Implementation Details 

We use image descriptors provided with the SUN and Shoes 
datasets for all methods: concatenated GIST and color his¬ 
tograms for Shoes, and GIST, color, HOG, and self-similarity 
histograms for SUN. See ( [Kovashka et al[ |2012t [Pattersonj 
[and Hays][201 2 1 for details. The datasets can be accessed at 
http://vision.cs.utexas.edu/whittiesearch/ 
and http://cs.brown.edu/^gen/sunattributes. 
html , respectively. We use the Bayesian Probabilistic Ma¬ 
trix Factorization (BPMF) implementation of [Xiong et al 
( 201Q| . We fix D = 5C|^ then use the default parameter set¬ 


tings. For N = 1000 and M = 195, MCMC with 500 sam¬ 
ples takes about 21 minutes. We cross-validate all classifier 
parameters. We set K automatically per attribute based on 
the optimal silhouette coefficient within K = {2,.. .,15}. 
Typically values ofK^T are chosen by the algorithm. We 
evaluate all 12 attributes listed in Section [3d] 

As noted in Section im during data collection annota¬ 
tors must explain their attribute labels. Specifically, we ask, 
“Please explain your response. What part or aspect of the 
image do you associate with the attribute [attribute name]? 
What part or aspect of the image led you to say that the at¬ 
tribute [attribute name] is present or not present?” Figure]^ 
shows a sample of annotators’ responses. We draw on their 
explanations below to aid our quantitative evaluation, but 
they are never seen by our method. 


4.2 Accuracy of Perceived Shade Predictions 

We first demonstrate how well shades capture perceived at- 

to 


Lampert et al 2009 

Farhadi et al 2009 Vaquero et al 

20091 [Branson et al 

2010| Wang and Mori [2010 Pat- 


terson and Hays 2012[ ); 


2. User-exclusive, which trains one attribute classifier 
per user using only his labeled images; 


3. User-ADAPTIVE, a transfer method ( [Kovashka and Graif - 
2013[ ) that adapts the majority vote model with the 


man 


tributes. We apply the shades as described in Section 3.4 


same user-specific labeled data as User-exclusive; 

4. Attribute discovery, an alternative shade forma¬ 
tion method that clusters images in the space of non- 
semantic attributes. These attributes are splits in the fea¬ 
ture space that are discriminative for object categories, 
and we find them with the state-of-the-art method of 
Rastegari et al ( 2012[ p 


5. and Image clusters, an additional alternative shade 
formation method inspired by prior work for discovering 
word “senses” (e.g., [Loeff et al[ ( [?006[ )) that clusters the 
image descriptors for all images labeled positive by at 
least one annotator. 

For the last two baselines, in order to map an image clus¬ 
ter to ground truth descriptions, we look at the bag of images 
each annotator labeled as positive, find the image cluster to 
which the largest portion of the bag belongs, and assign it to 
be this user’s shade ID. 


All methods use linear SVMs for consistency with Ko 


predict user-specific labels. We compare to five methods: 


[vashka and GraumiEi| ( [2Q13[ ). Our method selects K auto¬ 
matically per attribute, yielding values between 5 and 10. 
We run 30 trials, sampling 20% of the available labels to ob¬ 
tain on average 10 labels per user (representing what a user 
might reasonably contribute to train the system). 


1. Standard, which is the standard consensus approach 
used in ( [Ferrari and Zisserman|[2007 [[Kumar ern|[201 ![ 

^ See Figurej^for an experiment on the sensitivity of our method to 
the choice of D. 


^ We use the code kindly provided by the authors; we train it with the 
10 Shoes and 611 SUN categories in the training images used by our 
method. We also tried using the method of [Rastegari et al[ ( [2012} with 
the semantic attributes as “categories”, but it performed significantly 
worse. 
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Pointy: avg accuracy 0.28 (chance 0.14) 



Cluttered: avg accuracy 0.26 (chance 0.20) 



1 2 3 4 5 

Predicted shade 


Fig. 5 Accuracy of perceived shade predictions: Confusion matrices 
for multi-way shade classification, for the attributes “pointy” and “clut¬ 
tered”. 


Table shows the results. Our shade discovery method 
outperforms all other methods. It is more reliable than STAN¬ 
DARD, which is the status quo attribute learning approach. 
For “open”, we achieve an 8 point gain over Standard 
and User-exclusive, which indicates both how different 
user perceptions of this attribute are, as well as how useful 
it is to rely on schools rather than individual users. SHADES 
also outperform the User-adaptive approach, while re¬ 
quiring the exact same labeling effort. While that method 
learns personalized models, shades leverage common per¬ 
ceptions and thereby avoid overfitting to a user’s few labeled 
instances. For example, on “brown”, USER-ADAPTIVE ac¬ 
tually decreases the accuracy of Standard, which shows 
that personalizing to individuals can be overkill as not ev¬ 
ery user has a unique perception. Rather, there are multi¬ 
ple shades of the attribute, and a user subscribes to some 
shade, hence Shades’ superior performance. Shades also 
outperform the two alternative shade formation baselines— 
Attribute discovery and Image clusters. This shows 
that our approach for forming shades produces the highest- 
quality clusters which are most aligned with true user group¬ 
ings based on the data provided, compared to other more 
“obvious” baselines. 

While Table measures binary attribute classification, 
our method can also perform multi-way shade classification. 
For this result, we cluster in the latent feature space of the 
images Ij, and again automatically select K. Figurej^shows 
representative resulting confusion matrices for the attributes 
“pointy” and “cluttered”. Our average multi-way accuracy 
over all attributes is 0.28, much better than chance (0.15 on 
average). This result indicates the discovered shades per at¬ 
tribute are indeed distinct and detectable. 

These results demonstrate the utility of shades. For all 
attributes, mapping a person’s use of an attribute to a shade 
allows us to predict attribute presence more accurately. This 
is achieved at no additional expense for each user. As a re¬ 
sult, applications demanding descriptive attributes (e.g., im¬ 
age search, zero-shot learning, etc.) can benefit from the 
more accurate representation. 

Finally, we study the impact of the number of latent fac¬ 
tors D on the accuracy of attribute prediction with shades. 


0.72 



05 


0.66 


^D=10 

--D=30 

^D=50 

^D=70 

D=90 


0.64 


5 10 15 

number of clusters 


Fig. 6 Variance of shades’ performance as a function of the number of 
latent factors D. 


In general, we can expect higher values of D to enable better 
accuracy, whereas lower values of D to allow faster compu¬ 
tation. We run BPMF with D = (10,100) in increments of 
20. In Figure we plot attribute accuracy as a function of 
D, with varying values for K (as the choice of K might de¬ 
pend on the choice of D). This figure shows an average over 
all attributes and 10 runs per attribute. For 10 of the 12 at¬ 
tributes, the difference between accuracy scores is no more 
than 1% depending on the choice of D, hence the small vari¬ 
ance in the averaged plot. Therefore, we conclude that our 
approach is not very sensitive to the choice of D. 


4.3 Personalized Image Search with Shades 


Next we examine how the accurate perceived attribute mod¬ 
els offered by shades can positively impact an image search 
application. 

First, we collect additional data for the Shoes attributes 
in Table [2 such that the same images are labeled for all at¬ 
tributes, and all users label all attributes Q This is necessary 
since in the data collection described in Section mi many 
users only labeled a single attribute, so we have very few 
cases of multiple attributes labeled by the same user for the 
same image. We ask each of 200 users to label 40 images for 
each attribute, out of a total set of 200 images that receive 
labels from any user. We use 50 images total for training, 75 
for testing, and 75 for cross-validation. We repeat the shade 
formation and shade-based attribute prediction procedure as 


in Section 3.4 using the training data from each user. 

We then pose multi-attribute queries with the test im¬ 
ages. For each test image and user, we generate all g^-tuples 
of the attributes with labels from the user. Each of these tu¬ 
ples forms a multi-attribute query composed of q attributes 


^ We omit the attribute “brown” since it only appears in a small set 
of images. 
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g 

Shades 

Standard 

User-exclusive 

User-adaptive 
jKovashka and Grauman 2013| 

Chance 

2 

53.3 (0.1) 

50.1 (0.1) 

43.3 (0.1) 

50.9(0.1) 

25 

3 

39.8 (0.1) 

36.3 (0.1) 

29.4 (0.1) 

37.4 (0.1) 

12.5 

4 

29.7 (0.2) 

26.5 (0.2) 

20.1 (0.2) 

27.9 (0.2) 

6.2 

5 

21.8 (0.5) 

18.8 (0.5) 

14.0 (0.4) 

20.7 (0.5) 

3.1 

6 

17.1(1.8) 

12.9(1.6) 

11.7(1.6) 

16.4(1.8) 

1.5 


Table 3 Multi-attribute query image search accuracy using shades, with standard error in parentheses, q is the number of attributes in the query. 


that a user might issue during search, e.g., ‘T want to buy 
ornate mid formal shoes.” We use the user’s labels as the 
ground truth for these queries, and examine the presence/absence 
predictions of the STANDARD, USER-EXCLUSiVE, User- 
adaptive, and Shades approaches on each g-attribute query. 
To quantify retrieval accuracy, we measure the fraction of 
these query images where the user’s ground truth labels and 
a model’s predictions agree on all q attributes per query. 

Tablej^shows the results, for g = {2,..., 6}. Our shades 
approach produces higher match rates, hence more accu¬ 
rate image search results, than any of the baselines, consis¬ 
tent with our result in Section [4^ For g = 2, our method 
achieves a 6% relative gain over Standard, and 5% gain 
over User-ADAPTIVE. This demonstrates that in order for 
attribute-based searches to be successful, the retrieval sys¬ 
tem needs to interpret the user’s attribute queries correctly; 
shades allow the learning of robust models which are per¬ 
sonalized yet do not overfit to noise in a user’s labels. 

Note that chance performance corresponds to the prob¬ 
ability of randomly matching all g attribute ground truth la¬ 
bels. All methods show a decrease in accuracy as more query 
words are used, since it becomes more difficult for a method 
to correctly predict the presence of all increasingly many 
attributes. 

Figure [7] shows a qualitative search result. We rank the 
subset of database images for which we have user labels 
based on how many of the requested attributes they are pre¬ 
dicted to have, for both the STANDARD approach and our 
Shades approach. We also show a subset of the user’s la¬ 
bels as well as the majority-voted labels for the same im¬ 
age, which helps explain the result. For the first query, notice 
how our method ranks the red stiletto shoe (outlined in red) 
compared to the baseline. Our method observes the user’s 
idea that shoes with very high heels are neither “formal” nor 
“pointy” (first column of user labels). Further, even though 
the user agreed with the crowd on the “formalness” of the 
sandal shoe outlined in purple, he rated other open shoes as 
''not formal”, so our shades model correctly learned that san¬ 
dals should be ranked low given a query for “formalness”. 

For the second query example, notice that even though the 
user agreed with the crowd regarding the “formalness” of 
the shoe outlined in green, he labeled other similar-looking 
shoes as "not formal”. Our shades model captures this trend. 




Attribute: "open area" 


Coherent w Not coherent ^ 

(Low entropy) (High entropy) 

Fig. 8 Illustration of our cluster coherency evaluation. Top: We pool 
together the label explanations from each user in a school, and then ex¬ 
amine the distribution over topics for each per-school document. Bot¬ 
tom: A coherent document is one that focuses on just a few topics (e.g., 
“open areas” which are inside, in this case) as opposed to many topics 
(e.g., both inside and outside “open areas”). 


inside building outside open 
area wide open ceiling air sky 
wall room window unconfined 
enclosed ... 


enclosed large room high ceiling 
wide open space large enclosed 
space open space wide are 
plenty room ... 


rather than overfitting to an individual user label, and ranks 
the green-outlined image low. 


4.4 Quantifying the Accuracy of Shade Formation 

To further quantify how accurately our shades capture per¬ 
ceived interpretations, we next score how coherent the tex¬ 
tual explanations (cf. Figure]^ are among annotators in the 
same shade. In particular, we quantify how coherent the la¬ 
bel explanations are when we pool the text from all users 
within a given shade. See Figure Whereas random clus¬ 
ters would group diverse ground truth explanations together, 
good shades should align with coherent explanations. We 
stress that these explanations are never seen by our algo¬ 
rithm; they are for evaluation purposes only. 

To measure coherency, we use a text analysis metric for 
topic entropy ( [Hall et ai||2008| ). We first perform probabilis¬ 
tic Latent Semantic Analysis (pLSA) ( |Hofmann| |1999| ) on 
the Porter-stemmed textual descriptions. We treat each de- 
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Query: "1 want pointy, formal shoes." 

User's labels (sample) 

Standard method's ranking: 

IL 

Shades' ranking: 

User: not formal 
Crowd: formal 

^ User: not formal 

— Crowd: formal 

W User: not pointy 

1 ^ Crowd: pointy 

Query: "1 want comfortable, formal shoes." 

User's labels (sample) 


Standard method's ranking: 




Shades' ranking: 







User: formal 
Crowd: formal 



User: not formal 
Crowd: formal 


User: formal 
Crowd: formal 

User: not formal 
Crowd: formal 


User: pointy 
Crowd: not pointy 




User: not formal 
Crowd: formal 

User: not formal 
Crowd: formal 


Fig. 7 Qualitative result of image search using shade models as opposed to standard attribute learning models. Our shades retrieve results which 
more accurately capture the user’s notion of the attributes, without overfitting to individual labels. See the text for more details. 


scription for which Lij = 1 as a document and discover 
T = 200 topics with pLSA. Then we map each explana¬ 
tion to its distribution of topics (a vector of T weights). This 
representation accounts for word meaning, not just word oc¬ 
currences (e.g., “image” and “picture” will be treated as syn¬ 
onyms by pLSA). Let denote the matrix whose columns 
are the T x 1 topic representation vectors for each of the V 
positive explanations corresponding to users in shade 5/^. 
We define the representation of topics in this shade as = 
V where the index v denotes a column of the 

matrix. Then we compute the overall topic entropy for this 
shade as — Qt Qt' Low entropy is better, as it in¬ 
dicates the shade corresponds to a more coherent set of de¬ 
scriptions focused around a few topics. 

We compare Shades to two methods defined above in 
Section l4^ 


1. Attribute discovery: the state-of-the-art non-semantic 


attribute discovery method of [Rastegari et al| pO 1 2 ) ; and 
2. Image clusters: an image clustering approach inspired 
by [Loeff et al|p006| ). 


These baselines represent how one might reasonably at¬ 
tempt to perform shade formation with existing techniques. 

Note that all methods use if-means and remove clusters 
with fewer than 10 members, which tend to be too sparse to 
form a meaningful shade. 

Figure shows the results. We plot topic entropy (and 
standard error) as a function of the number of shades if, 
over all attributes and 30 runs. Our shades are much more 
coherent overall. Clearly, image clustering falls short. The 
non-semantic attribute discovery method of [Rastegari et al 
( |2012| ), while stronger than clustering, does not capture the 


All 12 attributes 



Fig. 9 Quality of discovered attribute shades (low entropy indicates a 
more coherent shade/cluster). 


shades of meaning since it lacks human input on the attribute 
interpretation. When if = 2, the baselines have lower en¬ 
tropy than our shades, showing that very coarse groups are 
sufficiently found with image clustering; however, these clus¬ 
ters are too coarse according to the silhouette coefficient 
model selection, which selects if = 5 to if = 10 shades 
as the optimal setting. This shows the shades we have dis¬ 
covered are meaningful and accurately capture the varied 
attribute meanings that users employ. 

We now give some more information to help gauge the 
significance of these results. Our method achieves entropy 
which is about 0.2 lower than the entropy of the baseline 
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Attribute 

Entropies 

Explanations 

Open 

2.85 

“This shoe is open across the top of the 
foot, with a space between the ankle 
strap and the toe. It also has gaps along 
the sides of the toe.” 


2.62 

“Open represents that amount of foot 
that can be seen when the show is 
worn. The opening on this shoe allows 
for a portion of the upper foot to be 

seen.” 

Ornate 

2.45 

“I consider the shoe in Image 45 to be 
ornate (made in an intricate shape or 
decorated with complex patterns) be¬ 
cause it is oddly shaped, with a pattern 
and added strapping and it has a zipper 
pull that stands out.” 


2.27 

“I associate the pattern of the shoe with 
the attribute ornate. It is the way that 
the plaid is mixed in, its color, and the 
mixing of the color in the shoe laces as 
well that led me to say that the attribute 
ornate is present.” 

Open area 

2.41 

“You can see the sky and even though 
the photo is of a building there is 
plenty of open space surrounding it as 
well as the photography being taken 
outside.” 


2.23 

“Inside the net there is plenty of space, 
and room between the nets. There’s 
not too much room, but enough to be 
considered an open area. It’s also out¬ 
side so out of the nets is plenty of 
room.” 


Table 4 Pairs of annotation explanations with corresponding topic en¬ 
tropy. Bold is our emphasis. Notice how lower entropy corresponds to 
more focused description (second example in each attribute). Similarly, 
our shades method produces more focused clusters. See the text for an 
explanation. 


methods. In Table we show some pairs of individual de¬ 
scriptions which have about 0.2 difference in their topic dis¬ 
tribution entropies]^ Again, lower entropy denotes a more 
focused explanation. In Table the first explanation for 
“open” includes many unrelated details, while the second 
predominantly discusses the foot being seen. Similarly, a 
high-quality user cluster will correspond to explanations that 
focus on a single or a few topics. The second explanation 
for “ornate” focuses on color, hence achieves lower entropy. 
The second explanation for “open area” focuses on the words 
“room” and “space”. Just like the second explanation in each 
pair, the clusters that our method obtains are more focused. 


4.5 Visualizing Attribute Shades of Meaning 


Next, we provide qualitative results. Figure visualizes 
two shades each, for eight of the attributes. The images are 


^ Note that Figure [^captures entropies of distributions over a num¬ 
ber of descriptions, which are naturally higher than the topic entropy 
of a single description. 



Fig. 11 Image regions highlighted according to the importance of the 
localized features for learning the shades. Our method finds those lo¬ 
calized visual properties that determine whether a shaded attribute is 
present or not. 


those most frequently labeled as positive by annotators in a 
shade Sk . The (stemmed) words are those that appear most 
frequently in the annotator explanations (cf. Figure for 
that shade, after we remove words that overlap between the 
two. Font size reflects relative frequency. To aid readability, 
we also outline words that stand out as good representatives 
of the shade. Recall that the text annotations are not used by 
our approach during shade discovery. 

We see the shades capture nuanced visual sub-definitions 
of the attribute words. For example, for the attribute “brown”, 
one shade covers chocolate-colored shoes (top shade), while 
another is lighter and more gold (bottom shade). For “or¬ 
nate”, one shade focuses on straps/buckles (top), while an¬ 
other focuses on texture/print/patterns (bottom). For “com¬ 
fortable”, one shade emphasizes a low arch (top), while the 
other requires soft materials (bottom). For “pointy”, one fo¬ 
cuses on the front of the shoe (bottom), while another fo¬ 
cuses on heels/bases that are “slightly” pointy. For “open”, 
one shade includes open-heeled shoes, while another includes 
sandals which are open at the front and back. In SUN, the 
“open areas” attribute can be either outside (top) or inside 
(bottom). For “soothing”, one shade emphasizes scenes con¬ 
ducive to relaxing activities, while another focuses on aes¬ 
thetics of the scene. 

As discussed above, an important feature of our method 
is its ability to perform discovery independent of a particular 
image descriptor. To illustrate this, we next use the shades’ 
visual classifiers to examine their most informative localized 
features. We use Li regularization when training one-vs.- 
rest logistic regression classifiers for each shade, in order to 
isolate a sparse set of features most discriminative for that 
shade. For each 70 x 70 grid cell of the image, we sum the 
magnitude of the classifier weights for its features. Then we 
multiply those weights with the pixel intensities in order to 
visualize the relative impact of each portion of the image. 

Figure pT] shows example results. Brighter cells indicate 
regions more discriminative for that shade. For “open”, we 
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Fig. 10 Top words and images for two shades per attribute (top and bottom for each attribute). Best viewed on PDF or in color. Notice the subtle 
differences in the annotator notions of the attributes exemplified by both the images considered positive for each shade, as well as the most frequent 
words in the corresponding textual explanations. See the text for a more detailed description. 
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see one shade emphasizes openness at the back, and another 
openness at the toe. For “formal”, the top shade emphasizes 
the arch of the shoe, while the bottom one emphasizes the 
toes. Such examples illustrate how our method isolates vi¬ 
sual properties that support a shade, yet would not be tightly 
grouped if simply clustering global descriptors. 

Of course, learning discriminative spatially localized fea¬ 
tures is nothing new; our point is that shades are what en¬ 
able the training image groups that make this discriminative 
selection feasible. Furthermore, recent work using crowds 


to isolate informative spatial regions (Donahue and Grau- 


|man[ |2Q11[ Deng et aT| |2013| ) has a different purpose (fine¬ 
grained image classification) and takes an entirely different 
approach (explicitly asking labelers to outline the regions 
needed to make their label decisions). 


4.6 Exploiting Attribute Correlations for Cross-Attribute 
Transfer 

So far, we have discovered the shades of each attribute dis- 
jointly from other attributes. However, the attributes that we 
use are not completely independent. For example, there is 
notable correlation between the attributes “fashionable” and 
“formal”. We propose to exploit these correlations to pre¬ 
dict how a user will perceive an attribute for which he has 
not supplied any labeled examples, by transferring labels for 
this attribute from other users, and from other attributes la¬ 
beled by the same user. 

As mentioned in Section |3.3| matrix factorization can 
also be used to “fill in” missing values in the (user, image) 
label space. The value of an entry Lij can be computed as 
an inner product of the user Ai ’s and image Ij ’s latent factor 
vectors. 

However, this label imputation can also exploit multiple 
(user, image) label matrices together, if we stack these ma¬ 
trices in a tensor. In this case, the label matrix L becomes 
an M X X Z label tensor, where Z denotes the number 
of attributes being considered at once. We can decompose L 
as: 


D 

L = old^. oTd,:, 

<i=l 


( 11 ) 


where, the index d, : refers to the rows of the matrices and o 
refers to outer vector product. T is the D x Z matrix of latent 
factors for each of the Z attributes. We use the Bayesian ten¬ 
sor factorization of |Xiong et al| ( |201Q| for this formulation, 
which essentially extends the probabilistic matrix factoriza¬ 
tion approach of Salakhutdinov and Mnih discussed above 
to handle tensor data. 


Dataset 

Ours 

Chance 

Shoes 

SUN 

0.831 (0.001) 
0.770 (0.001) 

0.50 

0.50 


Table 5 Accuracy of imputing missing labels using other attributes, 
with standard error in parentheses. Utilizing attribute correlations al¬ 
lows us to accurately predict how a user will perceive a novel attribute, 
without having received any annotations for this attribute from this 
user. 


An entry Lijz denotes how user i labeled image j for 
attribute z. Equation]^ then becomes 


M N Z 

p(L|A,I,T,ct 2) = nnn 

i=l j=l z=l 

( 12 ) 

where Ai and Ij denote columns of A and I as before, 
denotes a column of T, and we model the prior over the la¬ 
tent factors in T as a spherical Gaussian, similar to A and I. 
See the Bayesian Probabilistic Tensor Factorization (BPTF) 
approach of ( [Xiong et al||2010| ) for more details. 

Using this tensor label imputation approach, we can com¬ 
plete a transfer learning task of predicting how a user who 
has never labeled an attribute will perceive this attribute, 
by relying on this user having labeled other attributes, and 
other users having labeled attribute 

Tablej^shows the results. For Shoes, we use the new data 
collected in Section [43] as it ensures all users have labeled 
all attributes, while for SUN we lack such data and use the 
data collected in Section [3T] We achieve a much higher ac¬ 
curacy than chance performance at 50%, thus showing that 
one can successfully transfer knowledge about one attribute 
to another. 


5 Conclusion 

Our work addresses the gap between how people describe 
attributes and how they perceive them visually. We show 
how to discover people’s shared biases in perception, then 
exploit them with visual classifiers that can generalize to 
new images. The proposed approach to discover attribute 
shades brings together language, crowdsourcing, human per¬ 
ception, and visual representations in a new way. 

The learned shades successfully tailor attribute predic¬ 
tions to cater to a user’s “school of thought”, boosting the 
accuracy of detecting perceived attributes. In systematic ex¬ 
periments, we quantify the impact of shades, both compared 
to standard paradigms and multiple state-of-the-art methods. 
We demonstrate that for image search applications, it is cru¬ 
cial to build robust personalized models that account for a 
user’s biases. The visualized shades show great promise to 
separate the (sub-)attributes involved in a person’s use of an 
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attribute vocabulary during image search or organization of 
image content. 

It is plausible that shades originate in part due to cultural 
differences that might be captured well by demographic in¬ 
formation, like a person’s location, age, etc. We conducted 
a preliminary study to determine whether shades correlate 
with demographics. We asked some annotators from the United 
States to name their city of residence, and after performing 
clustering in latent factor space, we mined for correlations 
between the clusters found and the annotators’ geographic 
locations. However, clusters in the latent factor space did not 
produce obvious clusters in geographic space. This suggests 
that shades are more subtle than what is captured within de¬ 
mographic parameters alone. This problem merits further 
exploration, including by extending the range of the study 
to countries other than the US. 

In future work, we will investigate ways to predict a per¬ 
son’s preferred shade based on a minimal set of label re¬ 
quests. We would also like to further explore the semantic 
relationships between the attributes, to determine how trans¬ 
fer across attributes might help learn shade models more ef¬ 
ficiently. Additionally, we would like to study approaches 
for automatically determining the degree of ambiguity in an 
attribute term from the attribute’s textual definition, possibly 
with the addition of a small number of image exemplars. Fi¬ 
nally, it would be intriguing to apply our approach for novel 
tasks, such as discovering the common types of errors anno¬ 
tators make (for purposes of illustration during training) and 
for examining ambiguity in descriptions of actions. 
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